sAT k A Unified Computational Lexicon for Hindi-English Code-Switching
نویسندگان
چکیده
We investigate how lexicons of languages in contact are merged to generate a fused lexicon for the code-mixed variety. Using the HPSG formalism, we develop computational lexicons for Hindi and English, and explore how these can be merged to obtain a fused-lect lexicon. We consider the HindiEnglish Code Switching variety (HECS), a stable variety that has resulted from contact between these languages. HECS uses words and larger phrasal constituents from one language with the syntax of the other, with the matrix language being predominantly Hindi. The grammar developed here captures this mixing of the two languages in terms of a unified lexicon that mixes pure English, pure Hindi, and cross-referenced lexical structures based on synset information for the entries. The construct of a hinge word is proposed to capture the cross-linguistic linkages which preserve the HPSG-based head-subcategory schema of the source lexicons. The claim is that the code-switching structures in a bilingual repertoire are triggered by cross-linguistic lexical representations that unify the matrix and embedded lexicons, and that computational mechanisms for handling this mixing can be constructed using the same principles.
منابع مشابه
A Hindi-English Code-Switching Corpus
The aim of this paper is to investigate the rules and constraints of code-switching (CS) in Hindi-English mixed language data. In this paper, we’ll discuss how we collected the mixed language corpus. This corpus is primarily made up of student interview speech. The speech was manually transcribed and verified by bilingual speakers of Hindi and English. The code-switching cases in the corpus are...
متن کاملComputational evidence that Hindi and Urdu share a grammar but not the lexicon
Hindi and Urdu share a grammar and a basic vocabulary, but are often mutually unintelligible because they use different words in higher registers and sometimes even in quite ordinary situations. We report computational translation evidence of this unusual relationship (it differs from the usual pattern, that related languages share the advanced vocabulary and differ in the basics). We took a GF...
متن کاملWord-level Language Identification in Bi-lingual Code-switched Texts
Code-switching is the practice of moving back and forth between two languages in spoken or written form of communication. In this paper, we address the problem of word-level language identification of code-switched sentences. Here, we primarily consider Hindi-English (Hinglish) code-switching, which is a popular phenomenon among urban Indian youth, though the approach is generic enough to be ex...
متن کاملBengali and Hindi to English Cross-language Text Retrieval under Limited Resources
This paper describes our experiment on two cross-lingual and one monolingual English text retrievals at CLEF in the ad-hoc track. The cross-language task includes the retrieval of English documents in response to queries in two most widely spoken Indian languages, Hindi and Bengali. For our experiment, we had access to a HindiEnglish bilingual lexicon, ’Shabdanjali’, consisting of approx. 26K H...
متن کاملFunctions of Code-Switching in Tweets: An Annotation Scheme and Some Initial Experiments
Code-Switching (CS) is very common among multilinguals who switch between two or more languages when communicating or having a dialogue with each other. People have not constrained CS to just spoken form but also have introduced this concept to written text. Due to the popularity of social-media, people have used this platform to perform CS in the text form. This gave rise to the need of comput...
متن کامل